Explore the User Habbits for Shared Bike in San Francisco

Investigation Overview

Bike-share has becoming increasingly popular for citizens and tourists in big cities, namely New York City, Chicago, San Francisco, etc. The ford-go bike data I'm exploring today has trip records in San Francisco during Feb 2019. The database recorded 183,412 trips from both subscribers and non-subscribers, it also give us the duration of each trip, the start/end stations and their altitude and longitude. In the age of global warming, biking regained its popularity among commuters like me. In this investigation, I want to explore the user habbits for the shared bike system in San Francisco to see if people take advantage of the system for commuting.

Dataset Overview

The data consists of 183,412 bike ride records with 17 features (duration, start/end date/time, start/end station and their id; start/end station's location, bike id, subscription type, user birth year and user gender). Most variables are categorical and are recorded in numeric datatype. The data consists on records in Feb 2019 in the city of San Francisco. There are missing data in all categories, but mostly in users' personal information.

What are the distribution of bike rental start time looks like?

The duration has a long-tailed distribution with only 1.5% of the rides lasts longer than 2700 sec (45min), which is the rental limit for subscribers (30min for non-subscribers). Most of the trip durations are around 500 sec (8min), seems like a good amount of time for commute or running errands.

In [42]:
# limit the x-axis to 0-2700
plt.figure(figsize=[ 11.69, 8.27])

bin_edges = np.arange(60, 2700, 100)
sb.distplot(bike['duration_sec'], bins = bin_edges, kde = False,
            hist_kws = {'alpha' : 1})
plt.xlabel('Duration (sec)')
plt.ylabel('Count')
plt.title('Trip Durations', y=1.04, fontsize=14, weight='bold')

plt.show()

Do users rent more frequently during commute hours?

The graph of rental start time suggests that most of the trips are started 8:00 and 17:00, which are the time for most of people go to work.

In [41]:
plt.figure(figsize=[ 11.69, 8.27])

base_color = sb.color_palette()[0]
sb.countplot(data=bike, x='start_hour', color=base_color);
plt.xlabel('Rental Start Time (:00)')
plt.ylabel('Rental Counts')
plt.title('Rental Start Time Frequency',y=1.04, fontsize=14, weight='bold')

plt.show()

Do subscribers tends to ride bikes for longer time given the benefit of an extra 15 min?

We have data for 163.5k subscriber and 19.9k non-subscribers. Most users ride 12-13min despite s/he is a subscriber or not. Surprisingly, subscribers on average took a shorter ride than non-subscribers.

In [40]:
plt.figure(figsize=[ 11.69, 8.27])

ax2=sb.violinplot(data=bike, x='user_type', y='duration_sec', color=base_color);
ax2.set(ylim=(0, 2700), yticks = [480, 1020, 1500, 1980, 2520], yticklabels= ['8min', '17min', '25min', '33min', '42min'], xticklabels = ['non-Subscribers','Subscribers'])
ax2.set(ylabel='Duration', xlabel='User Type')
plt.title('Trip Duration Distribution for non-Subscriber and Subscriber users',y=1.04, fontsize=14, weight='bold')
plt.show()

Geographic features of the Stations

  • Where does the rental stations located?
  • Where are the top 10 most popular start station located?
  • How long do the trips last for the top 10 stations? Is there any pattern?
  • Where do most of the trips end for the top 10 start stations?

Where do the rental stations located?

San Francisco's shared bike system has three centers - San Francisco downtown, Oakland, and Silicon Valley (plus San Jose).

In [21]:
# plot the start station location on the map
m=folium.Map([37.550108, -122.265746], zoom_start=8)
hm_wide = HeatMap(
    list(zip(bike.start_station_latitude.values, bike.start_station_longitude.values)),
    min_opacity=0.2,
    radius=5, 
    blur=5, 
    max_zoom=1,
)

# plot heatmap
loc = 'Trip Stations in San Francisco'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   



m.add_child(hm_wide)
m.get_root().html.add_child(folium.Element(title_html))

m
Out[21]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The top 10 popular stations are all from San Francisco downtown area.

In [22]:
# explore top 10 start location and duration
# extract the top 10 popular start stations' name (by count) and pull the related data from orginal dataframe
start_10=bike.start_station_name.value_counts().index.tolist()
start_10=start_10[0 : 10] 
bike_10 = bike.loc[bike['start_station_name'].isin(start_10)]

# where are the most popular stations located 
m2=folium.Map([37.791852, -122.423597], zoom_start=12)
hm_wide = HeatMap(
    list(zip(bike_10.start_station_latitude.values, bike_10.start_station_longitude.values)),
    min_opacity=0.2,
    radius=13, 
    blur=10, 
    max_zoom=1,
)

# plot heatmap
loc = 'Top 10 Start Stations Are Located in Downtown San Francisco'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   



m2.add_child(hm_wide)
m2.get_root().html.add_child(folium.Element(title_html))

m2
Out[22]:
Make this Notebook Trusted to load map: File -> Trust Notebook

How long do the trips last for the top 10 stations? Is there any pattern?

After zooming in to see the duration of 0-45 min and 0-30 min, I observed that trips starting from the 10 stations mostly last 6-15 min with the average trip duration of around 10mins. Trips starts from San Francisco Ferry Building Station seems to have longer duration time than other popular.

In [39]:
# the duration has a long tail, explore the duration time below 45 min (2700 sec) and below 30 min (1800 sec)

fig = plt.figure(figsize = [11.69, 8.27])
base_color = sb.color_palette()[0]

plt.subplot(1,2,1)
ax1=sb.boxplot(data=bike_10, y='start_station_name', x='duration_sec', color=base_color, order=start_10)
ax1.set(xlim=(0, 2700), xticks = [480, 1020, 1500, 1980, 2520], xticklabels= ['8min', '17min', '25min', '33min', '42min'], 
        ylabel = 'Top 10 Popular Start Station (Top 1 to 10)', title = 'Zoom in to 0-45 min Trip Duration', xlabel = None)

plt.subplot(1,2,2)
ax2=sb.boxplot(data=bike_10, y='start_station_name', x='duration_sec', color=base_color, order=start_10)
ax2.set(xlim=(0, 1800), xticks = [360, 720, 1080, 1440, 1800], xticklabels= ['6min', '12min', '18min', '24min', '30min'], 
        yticklabels = [], ylabel=None, title = 'Zoom in to 0-30 min Trip Duration', xlabel=None)

fig.text(0.5, -0.02, 'Duration (min)', ha='center', va='center')
fig.suptitle("Trip Durations in Top 10 Start Stations", y = 1.04, fontsize = 14, weight = "bold")

plt.tight_layout();

plt.show()

Where do most of the trips end for the top 10 start stations?

From the most popular 10 stations, most of the trips' (80%, duration ranges 6-15 mins) ending stations spread out in the the San Francisco downtown area.

In [23]:
# Further investigate the trip durations within 6-15 min (180-900 sec)
bike_10_615 = bike_10.loc[(bike_10['duration_sec'] > 179 ) & (bike_10['duration_sec'] < 901)]

# where does most trip ends from the most popular 10 start station 
m3=folium.Map([37.791852, -122.423597], zoom_start=12)
hm_wide = HeatMap(
    list(zip(bike_10_615.end_station_latitude.values, bike_10_615.end_station_longitude.values)),
    min_opacity=0.2,
    radius=13, 
    blur=10, 
    max_zoom=1,
)


# plot heatmap
loc = 'Popular End Stations for the Top 10 Start Stations'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   



m3.add_child(hm_wide)
m3.get_root().html.add_child(folium.Element(title_html))

m3
Out[23]:
Make this Notebook Trusted to load map: File -> Trust Notebook

When do the trips usually start for the top 10 stations?

On a further investigation of the rental start time, most of the trips happened on 8:00 and 17:00, similar to the commute time.

In [43]:
# Take a look at when does these trip usually happens
plt.figure(figsize=[ 11.69, 8.27])

sb.countplot(data=bike_10_615, x='start_hour', color=base_color)
plt.xlabel('Rental Start Time (:00)')
plt.ylabel('Rental Counts')
plt.title('Rental Start Time for 80% Trips From the Top 10 Stations', y=1.04, fontsize=14, weight='bold')
plt.show()

How does the trip start time distribute differently for the top 10 start stations?

Despite that the entire sample set suggests that the top 10 stations are in particular popular among usual commute time, namely 8:00 and 17:00, each station is different. Market St & 10th St is popular during daytime, suggesting that the station might be located at a popular tourism site or among office buildings; Caltrain Stations (& 2) are mostly popular at 5:00 and 14:00, maybe due to people commuting in the form of both train and bike.

12:00, 13:00 and 14:00 are popular time when people unlock the bikes, which trips could be used to go to lunches.

In [65]:
time=bike_10['start_hour'].sort_values().drop_duplicates().to_list()

# explore relationship among time, duration, and the top 10 start stations
g = sb.FacetGrid(data = bike_10, col = 'start_station_name', height = 8.27/4, aspect = (14.70/3)/(8.27/4), 
                 col_wrap=3, col_order = start_10, sharex=False)
g.map(plt.scatter, 'start_hour', 'duration_sec', alpha=1/20)
plt.subplots_adjust(top=0.9)

g.set(ylim=(0, 1800), yticks = [360, 720, 1080, 1440, 1800], yticklabels= ['6min', '12min', '18min', '24min', '30min'], xticklabels=time, xlabel = None, ylabel=None)
plt.setp(g.fig.texts, text="")
g.set_titles(row_template="{row_name}", col_template="{col_name}")

plt.tight_layout();

g.fig.suptitle('Trip Start Time & Duration distribution for Top 10 Start Stations (in ranking order)', y=1.04, fontsize=14, weight='bold')
g.fig.text(0.5, -0.01, 'Start Time', ha='center', va='center')
g.fig.text(-0.01, 0.5, 'Duration (sec)', va='center', rotation='vertical')

plt.show()
In [ ]: